A core process in human cognition is analogical mapping: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820 validated analogies. Our experiments demonstrate that state-of-the-art models do well when distractors are chosen randomly (~86%), but struggle with carefully chosen distractors (~53%, compared to 90% human accuracy). We hope our dataset will encourage the development of new analogy-making models. Website: https://vasr-dataset.github.io/
translated by 谷歌翻译
虽然视觉和语言模型在视觉问题回答等任务上表现良好,但在基本的人类常识性推理技能方面,它们会挣扎。在这项工作中,我们介绍了Winogavil:在线游戏,以收集视觉和语言协会(例如,狼人到满月),用作评估最先进模型的动态基准。受欢迎的纸牌游戏代号的启发,Spymaster提供了与几个视觉候选者相关的文本提示,另一个玩家必须识别它们。人类玩家因创建对竞争对手AI模型而具有挑战性的联想而获得了回报,但仍然可以由其他人类玩家解决。我们使用游戏来收集3.5k实例,发现它们对人类的直观(> 90%的Jaccard索引),但对最先进的AI模型充满挑战,其中最佳模型(Vilt)的得分为52% ,成功的位置在视觉上是显着的。我们的分析以及我们从玩家那里收集的反馈表明,收集的关联需要多种推理技能,包括一般知识,常识,抽象等。我们发布数据集,代码和交互式游戏,旨在允许未来的数据收集,可用于开发具有更好关联能力的模型。
translated by 谷歌翻译
Neural Representations have recently been shown to effectively reconstruct a wide range of signals from 3D meshes and shapes to images and videos. We show that, when adapted correctly, neural representations can be used to directly represent the weights of a pre-trained convolutional neural network, resulting in a Neural Representation for Neural Networks (NeRN). Inspired by coordinate inputs of previous neural representation methods, we assign a coordinate to each convolutional kernel in our network based on its position in the architecture, and optimize a predictor network to map coordinates to their corresponding weights. Similarly to the spatial smoothness of visual scenes, we show that incorporating a smoothness constraint over the original network's weights aids NeRN towards a better reconstruction. In addition, since slight perturbations in pre-trained model weights can result in a considerable accuracy loss, we employ techniques from the field of knowledge distillation to stabilize the learning process. We demonstrate the effectiveness of NeRN in reconstructing widely used architectures on CIFAR-10, CIFAR-100, and ImageNet. Finally, we present two applications using NeRN, demonstrating the capabilities of the learned representations.
translated by 谷歌翻译
Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications. Recently, LIDAR-supervised methods have achieved remarkable per-pixel depth accuracy in outdoor scenes. However, significant errors are typically found in the proximity of depth discontinuities, i.e., depth edges, which often hinder the performance of depth-dependent applications that are sensitive to such inaccuracies, e.g., novel view synthesis and augmented reality. Since direct supervision for the location of depth edges is typically unavailable in sparse LIDAR-based scenes, encouraging the MDE model to produce correct depth edges is not straightforward. In this work we propose to learn to detect the location of depth edges from densely-supervised synthetic data, and use it to generate supervision for the depth edges in the MDE training. %Despite the 'domain gap' between synthetic and real data, we show that depth edges that are estimated directly are significantly more accurate than the ones that emerge indirectly from the MDE training. To quantitatively evaluate our approach, and due to the lack of depth edges ground truth in LIDAR-based scenes, we manually annotated subsets of the KITTI and the DDAD datasets with depth edges ground truth. We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets.
translated by 谷歌翻译
Micron-scale robots (ubots) have recently shown great promise for emerging medical applications, and accurate control of ubots is a critical next step to deploying them in real systems. In this work, we develop the idea of a nonlinear mismatch controller to compensate for the mismatch between the disturbed unicycle model of a rolling ubot and trajectory data collected during an experiment. We exploit the differential flatness property of the rolling ubot model to generate a mapping from the desired state trajectory to nominal control actions. Due to model mismatch and parameter estimation error, the nominal control actions will not exactly reproduce the desired state trajectory. We employ a Gaussian Process (GP) to learn the model mismatch as a function of the desired control actions, and correct the nominal control actions using a least-squares optimization. We demonstrate the performance of our online learning algorithm in simulation, where we show that the model mismatch makes some desired states unreachable. Finally, we validate our approach in an experiment and show that the error metrics are reduced by up to 40%.
translated by 谷歌翻译
A master face is a face image that passes face-based identity authentication for a high percentage of the population. These faces can be used to impersonate, with a high probability of success, any user, without having access to any user information. We optimize these faces for 2D and 3D face verification models, by using an evolutionary algorithm in the latent embedding space of the StyleGAN face generator. For 2D face verification, multiple evolutionary strategies are compared, and we propose a novel approach that employs a neural network to direct the search toward promising samples, without adding fitness evaluations. The results we present demonstrate that it is possible to obtain a considerable coverage of the identities in the LFW or RFW datasets with less than 10 master faces, for six leading deep face recognition systems. In 3D, we generate faces using the 2D StyleGAN2 generator and predict a 3D structure using a deep 3D face reconstruction network. When employing two different 3D face recognition systems, we are able to obtain a coverage of 40%-50%. Additionally, we present the generation of paired 2D RGB and 3D master faces, which simultaneously match 2D and 3D models with high impersonation rates.
translated by 谷歌翻译
The use of needles to access sites within organs is fundamental to many interventional medical procedures both for diagnosis and treatment. Safe and accurate navigation of a needle through living tissue to an intra-tissue target is currently often challenging or infeasible due to the presence of anatomical obstacles in the tissue, high levels of uncertainty, and natural tissue motion (e.g., due to breathing). Medical robots capable of automating needle-based procedures in vivo have the potential to overcome these challenges and enable an enhanced level of patient care and safety. In this paper, we show the first medical robot that autonomously navigates a needle inside living tissue around anatomical obstacles to an intra-tissue target. Our system leverages an aiming device and a laser-patterned highly flexible steerable needle, a type of needle capable of maneuvering along curvilinear trajectories to avoid obstacles. The autonomous robot accounts for anatomical obstacles and uncertainty in living tissue/needle interaction with replanning and control and accounts for respiratory motion by defining safe insertion time windows during the breathing cycle. We apply the system to lung biopsy, which is critical in the diagnosis of lung cancer, the leading cause of cancer-related death in the United States. We demonstrate successful performance of our system in multiple in vivo porcine studies and also demonstrate that our approach leveraging autonomous needle steering outperforms a standard manual clinical technique for lung nodule access.
translated by 谷歌翻译
The field of emergent communication aims to understand the characteristics of communication as it emerges from artificial agents solving tasks that require information exchange. Communication with discrete messages is considered a desired characteristic, for both scientific and applied reasons. However, training a multi-agent system with discrete communication is not straightforward, requiring either reinforcement learning algorithms or relaxing the discreteness requirement via a continuous approximation such as the Gumbel-softmax. Both these solutions result in poor performance compared to fully continuous communication. In this work, we propose an alternative approach to achieve discrete communication -- quantization of communicated messages. Using message quantization allows us to train the model end-to-end, achieving superior performance in multiple setups. Moreover, quantization is a natural framework that runs the gamut from continuous to discrete communication. Thus, it sets the ground for a broader view of multi-agent communication in the deep learning era.
translated by 谷歌翻译
在视频分析中,背景模型具有许多应用,例如背景/前景分离,变更检测,异常检测,跟踪等。但是,尽管在静态相机捕获的视频中学习这种模型是一项公认的任务,但在移动相机背景模型(MCBM)的情况下,由于算法和可伸缩性挑战,成功率更加重要。由于相机运动而产生。因此,现有的MCBM在其范围和受支持的摄像头类型的限制中受到限制。这些障碍还阻碍了基于深度学习(DL)的端到端解决方案的这项无监督的任务。此外,现有的MCBM通常会在典型的大型全景图像或以在线方式的域名上建模背景。不幸的是,前者造成了几个问题,包括可扩展性差,而后者则阻止了对摄像机重新审视场景先前看到部分的案例的识别和利用。本文提出了一种称为DEEPMCBM的新方法,该方法消除了上述所有问题并实现最新结果。具体而言,首先,我们确定与一般和DL设置的视频帧联合对齐相关的困难。接下来,我们提出了一种新的联合一致性策略,使我们可以使用具有正则化的空间变压器网,也不是任何形式的专业化(且不差异)的初始化。再加上在不破坏的稳健中央矩(从关节对齐中获得)的自动编码器,这产生了一个无端到端的无端正规化MCBM,该MCBM支持广泛的摄像机运动并优雅地缩放。我们在各种视频上展示了DEEPMCBM的实用程序,包括超出其他方法范围的视频。我们的代码可在https://github.com/bgu-cs-vil/deepmcbm上找到。
translated by 谷歌翻译
本文提出了2022年访问量的挑战的最终结果。 OOV竞赛介绍了一个重要方面,而光学角色识别(OCR)模型通常不会研究,即,在培训时对看不见的场景文本实例的识别。竞赛编制了包含326,385张图像的公共场景文本数据集的集合,其中包含4,864,405个场景文本实例,从而涵盖了广泛的数据分布。形成了一个新的独立验证和测试集,其中包括在训练时出词汇量不超出词汇的场景文本实例。竞争是在两项任务中进行的,分别是端到端和裁剪的文本识别。介绍了基线和不同参与者的结果的详尽分析。有趣的是,在新研究的设置下,当前的最新模型显示出显着的性能差距。我们得出的结论是,在此挑战中提出的OOV数据集将是要探索的重要领域,以开发场景文本模型,以实现更健壮和广义的预测。
translated by 谷歌翻译